Zurich - 27 & 28 June 2022

Missing data mechanisms (recap)

Assuming the missing data mechanism (MCAR)

The methods to deal with missing data, implicitly assume a missing data mechanism.

MCAR: the most strict assumption. In practice it is also easiest to deal with MCAR data.

  1. Analyze the observed sample only (this will result in unbiased estimates).
  2. Use an imputation method to boost the power the the amount of missing data is too large.

Assuming the missing data mechanism (MAR)

MAR: less strict assumption. Most advanced missing data methods assume this mechanism (e.g. multiple imputation, FIML).

  • Include variables in study that may explain the missing data, a MAR assumption may become more plausible (as compared to MNAR).
  • These auxiliary variables may also help in dealing with the missing data.
  • Auxiliary variables: variables related to the probability of missing data or to the variable with missing data. Can be used as predictors in an imputation model or as covariates in a FIML model to improve estimations.

Assuming the missing data mechanism (MNAR)

MNAR: least strict assumption.

  • MNAR data are also referred to as non-ignorable, because these cannot be ignored without causing bias in results. MNAR data are more challenging to deal with.

Ad hoc missing data methods

Complete-case analysis

Complete-case analysis (CCA): only the cases with observed data for all variables involved are used in the analysis.

  • The most common and easy way to deal with missing data.
  • The default methods for many data analysis procedure.
  • Assumes MCAR

Parameter estimates:

  • Mean is unbiased for MCAR
  • Regression weight/correlation unbiased for MCAR
  • Standard errors are overestimated

CCA

linear regression when MCAR

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    0.347    0.205       1.69  0.0935
## 2 X2             0.225    0.0922      2.44  0.0167
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    0.418     0.325      1.29   0.204
## 2 X2             0.245     0.170      1.44   0.157

CCA

linear regression when MAR

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    0.347    0.205       1.69  0.0935
## 2 X2             0.225    0.0922      2.44  0.0167
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  -0.0302     0.335   -0.0900   0.929
## 2 X2            0.0942     0.150    0.627    0.534

Imputation

Imputation: relpacing the missing with a value.

Possible values to impute?

  • Mean
  • Regression estimate
  • Random value
  • Previous observation (longitudinal study)
  • Informed guess / estimate

Assumption of single imputation

  • Most important: assumes that the imputed value is an actual observed value.
  • Some assume MCAR (i.e. regression imputation, random value imputation)

Other examples:

  • All missing values are average (i.e. mean imputation)
  • People with missing data remain stable (i.e. LOCF)

Example data

gender n wgt n wgt mean wgt sd prf n prf mean prf sd
Man 81 73 86.009 9.660 73 7.998 3.539
Woman 61 50 75.782 8.478 52 9.782 3.238

Mean imputation

  • Imputing the average value for all missing entries
  • Very easy method

Parameter estimates:

  • Mean is unbiased when MCAR
  • Regression weight/correlation never unbiased
  • Standard errors are underestimated

Mean imputation

Example

Regression imputation

Conditional mean imputation, estimating the imputed value as a predicted value from a regression model: $Y_{imp} = _0 + _1 * X

  • Mean is unbiased when MAR
  • Regression weight is unbiased when MAR
  • Correlation is never unbiased
  • Standard errors are underestimated

Regression imputation

Example

Imputation from regression equation: \(Performance = \beta_0 + \beta_1 * weight + \beta_2 * gender\)

Regression imputation + sampling error

Stochastic regressions: = regression imputation, with additional sampling error added to the predicted value:

  • \(Y_{imp} = \beta_0 + \beta_1 * X + \epsilon\)

Sampling error is normally distributed.

  • Mean is unbiased when MAR
  • Regression weight is unbiased when MAR
  • Correlation is unbiased when MAR
  • Standard errors are underestimated

Imputation uncertainty is not taken into account

Regression imputation + residual error

Example

Imputation from regression equation: \(Performance = \beta_0 + \beta_1 * weight + \beta_2 * gender + \epsilon\)

Last observation carried forward

Ad hoc method for longitudinal data: use the previous observed value to impute the missing values.

Assumes that people that drop out of the study remain stable.

  • Mean, regression coefficient and correlation is never unbiased.
  • Standard errors are underestimated

Example data

  • A RCT for a weight loss intervention.
  • Measurements at baseline, mid-treatment (3m), post-treatment (6m), and two follow-up moments (12m and 24m).
  • Missing data occurs at the measurements after the baseline, due to study drop-out.

##    id group 0 3 6 12 24   
## 76  1     1 1 1 1  1  1  0
## 16  1     1 1 1 1  1  0  1
## 9   1     1 1 1 1  0  0  2
## 7   1     1 1 1 0  0  0  3
## 2   1     1 1 0 0  0  0  4
##     0     0 0 2 9 18 34 63

LOCF

Example imputed

The trajectories over time, with LOCF imputations in red.

LOCF

Example imputed

Below only the cases with missing observations, separated by group.

Multiple imputation

Multiple imputation idea

  • Imputing the missing data entries with multiple “plausible” values.
  • Takes imputation uncertainty into account
  • A method to improve the main analysis results, (so not to complete or fill in data)
    • Unbiased estimates
    • Improved precision and thus power
  • R package: mice

Multiple Imputation process

  • Imputation phase
  • Analysis phase
  • Pooling phase
  1. incomplete data
  2. generate multiple copies of the same dataset, but each time differenty imputed values
  3. analyze each imputed dataset
  4. pool results for analyses to final study result

Imputation phase

  • Imputing the missing data entries with multiple “plausible” values
  • Specify the imputation model
    • method to estimate imputed values
    • predictors used in the model

Algorithm for imputations

  • Fully Conditional Specification (FCS): imputes variable-by-variable.
  • For each variable the imputation model can be different:
    • method
    • predictors
  • Joint modelling: imputes from a multivariate distribution

General method

  • Impute missing value by using the predicted value estimated from observed data.
  • Add residual error to simulate sampling error
  • Depending on the variable type that needs to be imputed, the imputation method is defined.

Later more on methods

Predictors for imputation

Imputation model

  • For each variable, the most relevant predictors can be indicated
  • Use variables that are in the analysis model also in the imputation model (compatibility)
  • Add auxiliary variables
  • Advice to use no more than 25 predictors (also depends on sample size)

Auxiliary variables: variables related to the probability of missing data or to the variable with missing data.

Iterations

  • One iteration, consists of one cycle through all variables that need to be imputed.
    • The algorithm starts with a random draw from the observed data.
    • Then imputes the incomplete data in a variable-by-variable fashion.
  • Typically 5 to 10 iterations before an imputed dataset is generated.
  • Iterations are repeated to ensure convergence of the algorithm

Iteration plot

  • To check the convergence of the algorithm

Iteration plot

  • Use additional iterations when in doubt if convergence is reached

Iteration plot

  • Convergence may need more iterations

Iteration plot

  • Looks better for gen and tv, but there seems to be a problem for wgt.

Adapting imputation model

  • Changes in imputation model may be needed when:
  • Two variables are highly correlated (multi-collinearity)
  • A variable with so many missings that the imputations are unstable

Number of imputations

  • Theoretically, higher \(m\) is better.
  • Usually, use \(m = 5\) for model building and \(m ~ % missing data\) for final analysis.
  • In lower \(m\) the between-imputation-variance is enlarged by \(1/m\) before computing the total variance.

Analysis phase

  • Each imputed dataset is analyzed, with the substantive analysis model

  • This results in \(m\) sets of results

  • Workflow:

    • Imputation
    • Analysis
    • Pooling

Analysis example

## # A tibble: 10 x 6
##    term        estimate std.error statistic   p.value  nobs
##    <chr>          <dbl>     <dbl>     <dbl>     <dbl> <int>
##  1 (Intercept)   19.5      5.70        3.42 0.000792    153
##  2 Solar.R        0.122    0.0275      4.44 0.0000169   153
##  3 (Intercept)   21.5      5.65        3.80 0.000210    153
##  4 Solar.R        0.107    0.0273      3.93 0.000127    153
##  5 (Intercept)   21.4      5.53        3.87 0.000162    153
##  6 Solar.R        0.106    0.0271      3.90 0.000142    153
##  7 (Intercept)   23.1      5.85        3.95 0.000120    153
##  8 Solar.R        0.110    0.0283      3.89 0.000149    153
##  9 (Intercept)   22.3      5.80        3.84 0.000177    153
## 10 Solar.R        0.117    0.0281      4.17 0.0000513   153

Pooling phase

  • Pool the analysis results to obtain final parameter estimates.
  • Usual normally distributed parameters: Rubin’s Rules

Pooling example

## Class: mipo    m = 5 
##          term m   estimate         ubar            b            t dfcom
## 1 (Intercept) 5 21.5600906 3.257524e+01 1.803273e+00 3.473916e+01   151
## 2     Solar.R 5  0.1124339 7.638285e-04 4.843572e-05 8.219513e-04   151
##         df        riv     lambda        fmi
## 1 123.0708 0.06642860 0.06229072 0.07716663
## 2 118.0594 0.07609413 0.07071327 0.08606584

Pool output

  • riv: Relative increase in variance due to nonresponse
  • df Residual degrees of freedom for hypothesis testing
  • lambda Proportion of total variance due to missingness
  • fmi Fraction of missing information

Rubin’s Rules

  • Pooling of point estimates that are normally distributed over the imputed datasets.

  • Means, standard deviations, regression estimates, linear predictors, proportions.

  • For pooling point estimates, use mean:

    \(\hat\theta = \sum^m_{i=1}{\theta_i}\)

  • Pooling of variance or standard error around the estimate, combine the within and between imputation variance.

Variance pooling Rubins Rules

  • Between variance:

    \(\sigma^2_{between} = \frac{\sum^m_{i=1}(\beta_i - \overline\beta)^2}{m-1}\)

  • Within variance:

    \(\sigma^2_{within} = \frac{\sum^m_{i=1}\sigma^2_i}{m}\)

  • Total variance:

    \(\sigma^2_{total} = \sigma^2_{within} + \sigma^2_{between} + \frac{\sigma^2_{between}}{m}\)

Pooling of non-normal parameters

  • Use a transformation to approximate a normal distribution
  • Examples:
    • Correlation: Fisher \(z\)
    • Odds Ratio, Relative Risk, Hazard ratio: log transformation
    • Explained variance: Fisher \(z\) for \(\sqrt{R^2}\)

Imputation methods

Multiple imputation process

  • Imputation phase
  • Analysis phase
  • Pooling phase

In the imputation phase, imputed values are estimated using an imputation method.

Methods for continuous variables

  • Regression imputation method = norm.predict
  • Stochastic regression imputation (regression + residual error) method = norm.nob
  • Stochastic regression with parameter uncertainty method = norm
  • Predictive mean matching method = pmm

Example data

vars n mean sd median trimmed mad min max range skew kurtosis se
Ozone 1 116 42.129310 32.987884 31.5 37.797872 25.94550 1.0 168.0 167 1.2098656 1.1122431 3.0628482
Solar.R 2 146 185.931507 90.058422 205.0 190.338983 98.59290 7.0 334.0 327 -0.4192893 -1.0040581 7.4532881
Wind 3 153 9.957516 3.523001 9.7 9.869919 3.40998 1.7 20.7 19 0.3410275 0.0288647 0.2848178
Temp 4 153 77.882353 9.465270 79.0 78.284553 8.89560 56.0 97.0 41 -0.3705073 -0.4628929 0.7652217
Month 5 153 6.993464 1.416522 7.0 6.991870 1.48260 5.0 9.0 4 -0.0023448 -1.3167465 0.1145191
Day 6 153 15.803922 8.864520 16.0 15.804878 11.86080 1.0 31.0 30 0.0026001 -1.2224406 0.7166540

Regression imputation

Algorithm

  • Imputed value: \(Y_{imp} = \hat{\beta}_0 + X_{mis}\hat{\beta}_1 + \epsilon\)

  • Parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimated from the observed data.

Regression imputation

Concept

Regression imputation

Univariate imputation

Regression imputation

Multivariate imputation

Stochastic regression imputation

Algorithm

  • Imputed value: \(Y_{imp} = \hat{\beta}_0 + X_{mis}\hat{\beta}_1 + \epsilon\)

  • Parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are estimated from the observed data.

  • \(\epsilon\) is normally distributed residual error

Stochastic regression imputation

Concept

Stochastic regression imputation

Univariate

Stochastic regression imputation

Multivariate

Bayesian regression imputation

Algorithm

  • Imputed value: \(Y_{imp} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)

  • Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.

  • \(\epsilon\) is normally distributed residual error

Bayesian regression imputation

Concept

Bayesian regression imputation

Univariate

Bayesian regression imputation

Multivariate

Predictive mean matching

Algorithm

  • Estimate the predicted value using Bayesian regression imputation
    • \(Y_{imp} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)
    • Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.
    • \(\epsilon\) is normally distributed residual error
  • Select \(k\) nearest neighbors to this predicted value from the observed data
  • Randomly draw one donor to use as imputed value

Predictive mean matching

Concept

Predictive mean matching

Concept

Predictive mean matching

Univariate

Predictive mean matching

Multivariate

Methods for categorical data

  • Bayesian logistic regression method = logreg
  • Bayesian Polytomous regression method = polyreg
  • Classification and regression trees method = cart

Bayesian logistic regression

Algorithm

  • Imputed value: \(Log\frac{P(Y_{miss})}{1-P(Y_{mis})} = \dot{\beta}_0 + X_{mis}\dot{\beta}_1 + \epsilon\)

  • Parameters \(\dot{\beta}_0\) and \(\dot{\beta}_1\) are drawn from their posterior distribution.

  • \(\epsilon\) is normally distributed residual error

Bayesian logistic regression

Concept

Bayesian logistic regression

Algorithm

  • Fit a multinomial regression model.

  • Parameters are drawn from their posterior distribution (Bayesian).

  • Compute the predicted category.

  • Add normally distributed residual error to account for sampling variance.

Bayesian polytomous regression

Concept

Classification and regression trees

Algorithm

  • Draw a random bootstrap sample as training set.
  • Fit the tree model using the observed data.
  • Find the predicted terminal node for the missing value
  • Use the observed cases at the predicted terminal node as donors
  • Randomly draw the imputed value from the observed donor cases

Classification and regression trees

Concept

Classification and regression trees

Univariate

Classification and regression trees

Multivariate